This notebook details an implementation of the natural language inference model presented in (Parikh et al, 2016). The model is notable for the small number of paramaters and hyperparameters it specifices, while still yielding good performance.
In [1]:
import spacy
import numpy as np
We only need the GloVe vectors from spaCy, not a full NLP pipeline.
In [2]:
nlp = spacy.load('en_vectors_web_lg')
Function to load the SNLI dataset. The categories are converted to one-shot representation. The function comes from an example in spaCy.
In [3]:
import json
from keras.utils import to_categorical
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
def read_snli(path):
texts1 = []
texts2 = []
labels = []
with open(path, 'r') as file_:
for line in file_:
eg = json.loads(line)
label = eg['gold_label']
if label == '-': # per Parikh, ignore - SNLI entries
continue
texts1.append(eg['sentence1'])
texts2.append(eg['sentence2'])
labels.append(LABELS[label])
return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))
Because Keras can do the train/test split for us, we'll load all SNLI triples from one file.
In [8]:
texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')
In [9]:
def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):
sents = texts + hypotheses
# the extra +1 is for a zero vector represting NULL for padding
num_vectors = max(lex.rank for lex in nlp.vocab) + 2
# create random vectors for OOV tokens
oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))
oov = oov / oov.sum(axis=1, keepdims=True)
vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')
vectors[num_vectors:, ] = oov
for lex in nlp.vocab:
if lex.has_vector and lex.vector_norm > 0:
vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector
sents_as_ids = []
for sent in sents:
doc = nlp(sent)
word_ids = []
for i, token in enumerate(doc):
# skip odd spaces from tokenizer
if token.has_vector and token.vector_norm == 0:
continue
if i > max_length:
break
if token.has_vector:
word_ids.append(token.rank + 1)
else:
# if we don't have a vector, pick an OOV entry
word_ids.append(token.rank % num_oov + num_vectors)
# there must be a simpler way of generating padded arrays from lists...
word_id_vec = np.zeros((max_length), dtype='int')
clipped_len = min(max_length, len(word_ids))
word_id_vec[:clipped_len] = word_ids[:clipped_len]
sents_as_ids.append(word_id_vec)
return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])
In [10]:
sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)
In [11]:
texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')
In [12]:
_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)
We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token.
OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).
Note that we will clip sentences to 50 words maximum.
In [13]:
from keras import layers, Model, models
from keras import backend as K
The embedding layer copies the 300-dimensional GloVe vectors into GPU memory. Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix.
In [14]:
def create_embedding(vectors, max_length, projected_dim):
return models.Sequential([
layers.Embedding(
vectors.shape[0],
vectors.shape[1],
input_length=max_length,
weights=[vectors],
trainable=False),
layers.TimeDistributed(
layers.Dense(projected_dim,
activation=None,
use_bias=False))
])
The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input. Each block contains two ReLU layers and two dropout layers.
In [15]:
def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):
return models.Sequential([
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate),
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate)
])
The basic idea of the (Parikh et al, 2016) model is to:
Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3). Entailment is not symmetric. It may be enough to just use the hypothesis->text vector. We will explore this possibility later.
We need a couple of little functions for Lambda layers to normalize and aggregate weights:
In [16]:
def normalizer(axis):
def _normalize(att_weights):
exp_weights = K.exp(att_weights)
sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
return exp_weights/sum_weights
return _normalize
def sum_word(x):
return K.sum(x, axis=1)
In [17]:
def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):
input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')
input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')
# embeddings (projected)
embed = create_embedding(vectors, max_length, projected_dim)
a = embed(input1)
b = embed(input2)
# step 1: attend
F = create_feedforward(num_hidden)
att_weights = layers.dot([F(a), F(b)], axes=-1)
G = create_feedforward(num_hidden)
if entail_dir == 'both':
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
beta = layers.dot([norm_weights_b, b], axes=1)
# step 2: compare
comp1 = layers.concatenate([a, beta])
comp2 = layers.concatenate([b, alpha])
v1 = layers.TimeDistributed(G)(comp1)
v2 = layers.TimeDistributed(G)(comp2)
# step 3: aggregate
v1_sum = layers.Lambda(sum_word)(v1)
v2_sum = layers.Lambda(sum_word)(v2)
concat = layers.concatenate([v1_sum, v2_sum])
elif entail_dir == 'left':
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
comp2 = layers.concatenate([b, alpha])
v2 = layers.TimeDistributed(G)(comp2)
v2_sum = layers.Lambda(sum_word)(v2)
concat = v2_sum
else:
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
beta = layers.dot([norm_weights_b, b], axes=1)
comp1 = layers.concatenate([a, beta])
v1 = layers.TimeDistributed(G)(comp1)
v1_sum = layers.Lambda(sum_word)(v1)
concat = v1_sum
H = create_feedforward(num_hidden)
out = H(concat)
out = layers.Dense(num_classes, activation='softmax')(out)
model = Model([input1, input2], out)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
In [18]:
K.clear_session()
m = build_model(sem_vectors, 50, 200, 3, 200)
m.summary()
The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track.
Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs. Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment.
In [19]:
m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))
Out[19]:
The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%. The small difference might be accounted by differences in max_length (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set.
It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.
The following model removes consideration of the complementary vector (text to hypothesis) from the computation. This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed.
In [20]:
m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')
m1.summary()
The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function.
In [21]:
m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))
Out[21]:
This model performs the same as the slightly more complex model that evaluates alignments in both directions. Note also that processing time is improved, from 64 down to 48 microseconds per step.
Let's now look at an asymmetric model that evaluates text to hypothesis comparisons. The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.
We'll just use 10 epochs for expediency.
In [96]:
m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')
m2.summary()
In [97]:
m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)
Out[97]:
Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.
It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!